# check and install needed packages
There were 30 warnings (use warnings() to see them)
packages.used = c('tidytext', 'tidyverse', 'DT','wordcloud', 'wordcloud2',
'htmlwidgets', 'plotly', 'RColorBrewer', 'sentimentr', 'stringr', 'tm')
packages.needed = setdiff(packages.used,
intersect(installed.packages()[,1],
packages.used))
if(length(packages.needed)>0){
install.packages(packages.needed, dependencies = TRUE)
}
# load packages
library(stringr)
library(tidytext)
library(tidyverse)
library(DT)
library(htmlwidgets)
library(plotly)
library(RColorBrewer)
library(sentimentr)
library(wordcloud)
library(wordcloud2)
library(tm)
source('../lib/functions.R')
#source(): if any lib needs to be sourced
This notebook was prepared with the following environmental settings:
print(R.version)
_
platform x86_64-apple-darwin17.0
arch x86_64
os darwin17.0
system x86_64, darwin17.0
status
major 4
minor 0.3
year 2020
month 10
day 10
svn rev 79318
language R
version.string R version 4.0.3 (2020-10-10)
nickname Bunny-Wunnies Freak Out
There were 18 warnings (use warnings() to see them)
# read csv
data <- read_csv("../data/philosophy_data.csv")
head(data)
This is how the data looks like. Each row represents a sentence in the title; thus the number of rows represent the number of sentences written.
Sentences are already tokenized and the sentence length is already computed in the given data. However, we do need to count the number of tokens in each sentence and add column n_tokens. The length function does not work here because the column is a character string, not a list.
# add number of tokens
new_data <- data %>%
mutate(n_tokens = f.word_count(data$tokenized_txt))
# into a format that can be used to plot
tidy_df <- data %>%
dplyr::select(school, author, title) %>%
group_by(school, author) %>%
summarize(n_title = n_distinct(title),
n_sent = n()) %>%
pivot_longer(cols = c("n_title", "n_sent"),
names_to = "type",
values_to = "count")
Before doing a text analysis, we want to know some information on the data. Here are some of the basic questions regarding:
I. Number of Titles and Sentences
Some people like writing lengthy writings, some like it succinct. Is there a preferene or tendency in the length of works depending on schools or titles?
How many titles and sentences per author? Which author has the most manuscripts?
g2 <- tidy_df %>% filter(type== "n_sent") %>%
ggplot(aes(log10(count)))+
geom_histogram(binwidth = .1, color= "black")+
geom_density()+
labs(title = "Number of sentences for each title",
x= "log10(number of sentences)",
y="Frequency")
g2
Doesn’t follow a known probability distribution. The global maximum(mode) is at around 10e4.1.
In other words, does more titles mean more sentences?
Through this side by side comparison, we see that the order is different when ordered according to title or number of sentences. Thus it can be inferred that more titles doesn’t necessarily mean more sentences. For better visualization/easy comparison:
Cleveland Dot Plot
This better shows that the more number of sentences does not automatically mean more titles. So, the works are not all of similar length, but some are shorter than others and some are longer than others.
g5 <- data %>%
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
3: In readChar(file, size, TRUE) : truncating string with embedded nuls
4: In readChar(file, size, TRUE) : truncating string with embedded nuls
5: In readChar(file, size, TRUE) : truncating string with embedded nuls
6: In readChar(file, size, TRUE) : truncating string with embedded nuls
7: In readChar(file, size, TRUE) : truncating string with embedded nuls
8: In readChar(file, size, TRUE) : truncating string with embedded nuls
dplyr::select(school, author, title) %>%
ggplot()+
geom_bar(aes(fct_infreq(title), fill=school))+
scale_x_discrete(label = function(x) abbreviate(x, minlength = 7))+
theme(axis.text.x = element_text(angle=90, hjust = 1, vjust=0.5))
ggplotly(g5)
If you hover over the graph you can isolate each school in the graph, like faceting. We can see that the length of the title doesn’t depend on each school. For example, schools that have multiple titles such as Analytic and german_idealism, there isn’t a distinct pattern in the length of titles by school.
Since some authors have one title, for simplicity lets filter for authors with multiple works.
Again, hovering over the above plot, we can see that except for some authors, there is no clear pattern as to one would write short or long works. The horizontal line represents the mean of the number of sentences for all authors. By isolating the data for Descartes, we can see that both his work in the data shows very low number of sentences, whereas Heidegger’s works tend to be above average. Also, notice that Marx’s work is on both ends of the graph.
II. Sentence Length
Total sentence length per title
Average sentence length per title
length_info <-summary(data$sentence_length)
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
3: In readChar(file, size, TRUE) : truncating string with embedded nuls
4: In readChar(file, size, TRUE) : truncating string with embedded nuls
5: In readChar(file, size, TRUE) : truncating string with embedded nuls
6: In readChar(file, size, TRUE) : truncating string with embedded nuls
7: In readChar(file, size, TRUE) : truncating string with embedded nuls
8: In readChar(file, size, TRUE) : truncating string with embedded nuls
length_info
Min. 1st Qu. Median Mean 3rd Qu. Max.
20.0 75.0 127.0 150.8 199.0 2649.0
g.len <- new_data %>%
dplyr::select(sentence_length) %>%
ggplot(aes(log10(sentence_length)))+
geom_histogram(bins = 50, color="black")
g.len
token_info <- summary(new_data$n_tokens)
token_info
Min. 1st Qu. Median Mean 3rd Qu. Max.
0.00 13.00 22.00 25.69 34.00 398.00
g.token <- new_data %>%
dplyr::select(n_tokens) %>%
ggplot(aes(log10(n_tokens)))+
geom_histogram(bins = 50, color= "black")
g.token
For which sentence has 0 tokens?
id <- which(new_data$n_tokens == 0)
new_data[id,c("sentence_str", "tokenized_txt")]
#filter out for rows that holds sentences without meaning
new_data <- new_data %>%
filter(n_tokens >0)
summary(new_data$n_tokens)
corr_df <- title.per.school %>%
left_join(sent.len.df, by = c("school", "author", "title")) %>%
ungroup() %>%
dplyr::select(title, n_sent, n_tokens)
g.cor <- ggplot(corr_df)+
geom_point(aes(log10(n_sent), log10(n_tokens)))
g.cor
The scatter plot seems to show some correlation between the 2 variables.
Statistically,
shapiro.test(corr_df$n_sent)
shapiro.test(corr_df$n_tokens)
We cannot compute the Pearson correlation between these 2 variables since it does not pass the Shapiro-Wilk test of normality. Instead, we compute the Kendall correlation that computes correlation by the “rank”.
result <- cor.test(corr_df$n_sent, corr_df$n_tokens, method = "kendall")
result
Although we cannot conclusively say that the longer the sentences, the longer the work as that would be too much of a stretch, we still can say that the works are not consisted of excessively much short sentences or less lengthy sentences.
Many analysis has been done on how many distinct words were used by authors, or what words were mostly used by them. Instead, I tried to look at authors who had more than one title in this data to see if their use of words has changed over time. Let’s take a few authors from the list: Nietzsche, Hegel, and Kant
kant <- mlt_pub %>% filter(author == "Kant")
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
3: In readChar(file, size, TRUE) : truncating string with embedded nuls
4: In readChar(file, size, TRUE) : truncating string with embedded nuls
niet %>% select(title, original_publication_date) %>%
There were 50 or more warnings (use warnings() to see the first 50)
distinct(title, .keep_all=TRUE) %>% arrange(original_publication_date)
niet_txt1 <- niet %>% filter(original_publication_date==1886)
niet_txt2 <- niet %>% filter(original_publication_date==1887)
niet_txt3 <- niet %>% filter(original_publication_date==1888)
par(mfrow= c(1, 3))
create_wc(niet_txt1); create_wc(niet_txt2); create_wc(niet_txt3)
hegel %>% select(title, original_publication_date) %>%
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
3: In readChar(file, size, TRUE) : truncating string with embedded nuls
4: In readChar(file, size, TRUE) : truncating string with embedded nuls
5: In readChar(file, size, TRUE) : truncating string with embedded nuls
6: In readChar(file, size, TRUE) : truncating string with embedded nuls
7: In readChar(file, size, TRUE) : truncating string with embedded nuls
8: In readChar(file, size, TRUE) : truncating string with embedded nuls
distinct(title, .keep_all=TRUE) %>% arrange(original_publication_date)
hegel_txt1 <- hegel %>% filter(original_publication_date==1807)
hegel_txt2 <- hegel %>% filter(original_publication_date==1817)
hegel_txt3 <- hegel %>% filter(original_publication_date==1820)
par(mfrow= c(1, 3))
create_wc(hegel_txt1); create_wc(hegel_txt2); create_wc(hegel_txt3)
kant %>% select(title, original_publication_date) %>%
Warning messages:
1: In readChar(file, size, TRUE) : truncating string with embedded nuls
2: In readChar(file, size, TRUE) : truncating string with embedded nuls
3: In readChar(file, size, TRUE) : truncating string with embedded nuls
4: In readChar(file, size, TRUE) : truncating string with embedded nuls
5: In readChar(file, size, TRUE) : truncating string with embedded nuls
6: In readChar(file, size, TRUE) : truncating string with embedded nuls
7: In readChar(file, size, TRUE) : truncating string with embedded nuls
8: In readChar(file, size, TRUE) : truncating string with embedded nuls
9: In readChar(file, size, TRUE) : truncating string with embedded nuls
distinct(title, .keep_all=TRUE) %>% arrange(original_publication_date)
kant_txt1 <- kant %>% filter(original_publication_date==1781)
kant_txt2 <- kant %>% filter(original_publication_date==1788)
kant_txt3 <- kant %>% filter(original_publication_date==1790)
par(mfrow= c(1, 3))
create_wc(kant_txt1); create_wc(kant_txt2); create_wc(kant_txt3)
We see both changes and nonchanges.